Skip to main content

Overview

pipeline_utils.py is a centralized utilities module that provides shared constants, helper functions, and common configurations used across all EDL Pipeline scripts. This promotes code reusability, consistency, and easier maintenance.

Purpose

This utility module serves to:
  • Centralize Configuration: Single source of truth for common settings
  • Reduce Code Duplication: Shared functions used by multiple scripts
  • Improve Maintainability: Update headers/settings in one place
  • Enhance Security: User-agent rotation to avoid detection

How Scripts Use It

Multiple pipeline scripts import and use these utilities:
from pipeline_utils import get_headers, BASE_DIR

# Example from fetch_all_indices.py
response = requests.post(url, json=payload, headers=get_headers(), timeout=15)

# Example from fetch_fundamental_data.py
headers = get_headers(include_origin=True)
Scripts Using This Module:
  • fetch_all_indices.py - Uses get_headers()
  • fetch_fundamental_data.py - Uses get_headers(include_origin=True)
  • fetch_technical_data.py - Uses get_headers()
  • Any future pipeline scripts requiring standardized headers

API Reference

Constants

BASE_DIR
str
Absolute path to the directory containing the EDL Pipeline scripts.Usage:
output_path = os.path.join(BASE_DIR, "data", "output.json")
Value:
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
USER_AGENTS
list[str]
List of browser user-agent strings for rotation to avoid detection.Contents:
  • Chrome on Windows 10
  • Chrome on macOS
  • Chrome on Linux
  • Firefox on Windows 10
  • Safari on macOS
Total Count: 5 user agents

Functions

get_headers()

Generates standard HTTP headers with a randomly selected user-agent for API requests.
include_origin
bool
default:"False"
Whether to include Origin and Referer headers for CORS compliance.
  • False: Returns basic headers (Content-Type, User-Agent, Accept)
  • True: Adds Origin and Referer headers pointing to scanx.dhan.co
Returns: dict A dictionary containing HTTP headers. Return Structure (include_origin=False):
{
    "Content-Type": "application/json",
    "User-Agent": "<random user agent>",
    "Accept": "application/json, text/plain, */*"
}
Return Structure (include_origin=True):
{
    "Content-Type": "application/json",
    "User-Agent": "<random user agent>",
    "Accept": "application/json, text/plain, */*",
    "Origin": "https://scanx.dhan.co",
    "Referer": "https://scanx.dhan.co/"
}
Usage Examples:
import requests
from pipeline_utils import get_headers

# Simple API call
response = requests.post(url, json=payload, headers=get_headers())

Source Code

"""
═══════════════════════════════════════════════════
  EDL Pipeline — Shared Utilities
  Centralizes common constants and helpers used
  across all fetch_*.py scripts.
═══════════════════════════════════════════════════
"""

import os
import random

# ── Base Directory (all scripts live here) ──
BASE_DIR = os.path.dirname(os.path.abspath(__file__))

# ── User Agents (rotated to avoid detection) ──
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15",
]


def get_headers(include_origin=False):
    """Return standard API headers with a random User-Agent."""
    h = {
        "Content-Type": "application/json",
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "application/json, text/plain, */*",
    }
    if include_origin:
        h["Origin"] = "https://scanx.dhan.co"
        h["Referer"] = "https://scanx.dhan.co/"
    return h

Design Decisions

Many APIs implement rate limiting or blocking based on user-agent strings. By rotating between different browser user-agents, the pipeline mimics organic traffic patterns and reduces the likelihood of being flagged or blocked.
Some endpoints require CORS (Cross-Origin Resource Sharing) headers (Origin and Referer) to validate requests, while others don’t. The include_origin parameter provides flexibility without code duplication.
Having a centralized base directory path allows scripts to reference relative paths consistently, making the pipeline portable across different environments without hardcoded paths.

Best Practices

Always Import from pipeline_utils

Instead of hardcoding headers in each script, always import get_headers() to ensure consistency and benefit from updates.Good:
from pipeline_utils import get_headers
headers = get_headers()
Avoid:
headers = {
    "Content-Type": "application/json",
    "User-Agent": "Mozilla/5.0..."
}

Use include_origin for ScanX API

When calling ScanX endpoints (dhan.co domains), use include_origin=True to ensure CORS compliance.
headers = get_headers(include_origin=True)

Extend, Don't Duplicate

If you need additional utility functions, add them to pipeline_utils.py instead of creating separate utility files.

Dependencies

  • os: File path operations
  • random: Random selection of user agents
This module has no external dependencies beyond Python standard library, making it lightweight and portable.

Future Enhancements

Potential additions to this utility module:
  • Centralized logging configuration
  • Retry logic with exponential backoff
  • Environment variable management
  • Common data validation functions
  • Shared file I/O helpers